Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache various statistics to improve performance #204

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

JCGoran
Copy link
Contributor

@JCGoran JCGoran commented Jul 21, 2024

An implementation that resolves #201.

Describe your changes

This is a rather large-looking PR (only because I am basing this off of #198 for ease-of-merge!), and is a sort of a first attempt (which is why I am marking it as a draft) at caching the various statistics properties as described in #201.

TL;DR: since at each iteration we only move one point in the dataset (basically map $(x_i, y_i) \mapsto (x'_i, y'_i)$ ), why not compute the new statistics ($\langle X' \rangle$, $\text{Var}(X') $, and $r(X', Y') $) in terms of the old ones ($\langle X \rangle$, $\text{Var}(X) $, and $r(X, Y) $) + the old ($(x_i, y_i) $) and new ($(x'_i, y'_i) $) points? This is essentially what this PR does.

Details

Mathematical derivation

The mean is shifted as:

$$ \langle X' \rangle = \langle X \rangle + \frac{\delta_x}{n} $$

where $\delta_x = x'_i - x_i$ (same for $Y$), the variance (or the standard deviation if you want) is shifted as:

$$ \text{Var}(X') = \text{Var}(X) + 2 \frac{\delta_x}{n} (x_i - \langle X \rangle) + \frac{\delta_x^2}{n} - \frac{\delta_x^2}{n^2} $$

while the correlation coefficient is shifted as:

$$ r(X', Y') = \frac{\langle X Y \rangle + \frac{1}{n}(\delta_x y_i + \delta_y x_i + \delta_x \delta_y) - \langle X' \rangle \langle Y' \rangle}{\sqrt{\text{Var}(X') \text{Var}(Y')}} $$

where the only new quantity to keep track of is $\langle X Y \rangle$ (the rest can be obtained from the above 2).

Workflow

With the math out of the way, this is how the new way of computing the statistics works:

  • read the (x, y) dataset, and make an instance of Statistics (subject to change for its somewhat ambiguous naming) from it
  • start the iteration
  • Statistics has a method, perturb, which takes in 3 arguments: the row (index) we wish to perturb, and the 2 values of the perturbations (in the x and y directions), and returns an instance of the new SummaryStatistics (i.e. the one from the perturbed data)
  • if _is_close_enough returns True, we call the perturb method again (it's $O(1)$ so not very expensive to call), but this time with update=True, which actually updates the data in the Statistics instance
  • go to the next iteration

Note that there's a bunch of new methods (shifted_mean, shifted_var, shifted_stdev, shifted_corrcoef), which we may or may not want to make private (or refactor so they accept different arguments), as I didn't put too much thought into the overall API design (again, hence the "draft" status).

Performance

Now, onto the performance results! Note that the benchmarks were done with #201 already in there, so the absolute numbers differ from the ones on main, but the reduction in the number of function calls is evident. I tested everything using python -m cProfile -m data_morph --seed 42 --start-shape panda --target-shape [shape].

# star shape - perf before
79257097 function calls (78610163 primitive calls) in 43.075 seconds
# star shape - perf after
54021781 function calls (52470658 primitive calls) in 28.704 seconds
# rings shape - perf before
94095798 function calls (93420792 primitive calls) in 43.998 seconds
# rings shape - perf after
68846080 function calls (67219947 primitive calls) in 29.592 seconds
# bullseye shape - perf before
82412199 function calls (81796202 primitive calls) in 40.245 seconds
# bullseye shape - perf after
57109757 function calls (55635954 primitive calls) in 26.200 seconds

so we use 26.1M less function calls for all shapes in question, which results in about 35% faster performance (computed as (t_old - t_new) / t_old) * 100).

Discussion items

Some possible points of discussion:

  • I didn't pay too much attention to Bessel's correction in the derivation of the formulas above, though I imagine the numerical difference should be mostly negligible since all of the starting datasets have > 100 points
  • API design (make shifted_* functions private? Could perturb do without an update argument? etc.)
  • the tests for shifted_* functions are a bit repetitive, and could take a while to run (I was initially very skeptical of the numerical stability, but it appears to be fine, though it could just be that I haven't tried enough scenarios to hit an instability)

Feedback is welcome :)

Checklist

  • Test cases have been modified/added to cover any code changes.
  • Docstrings have been modified/created for any code changes.
  • All linting and formatting checks pass (see the contributing guidelines for more information).

@github-actions github-actions bot added testing Relating to the testing suite shapes Work relating to shapes module data Work relating to data module plotting Work relating to plotting module morpher Work relating to morpher module labels Jul 21, 2024
The implementation uses an AVL tree to keep track of the low and high
parts of the input array, and then updates the trees accordingly in
O(log n) time (instead of O(n) provided by numpy.median).
@github-actions github-actions bot added the packaging Relating to packaging label Nov 10, 2024
Copy link

codecov bot commented Nov 11, 2024

Codecov Report

Attention: Patch coverage is 96.47059% with 9 lines in your changes missing coverage. Please review.

Project coverage is 98.21%. Comparing base (5d0486b) to head (656c96d).

Files with missing lines Patch % Lines
src/data_morph/data/stats.py 91.34% 6 Missing and 3 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #204      +/-   ##
==========================================
- Coverage   98.47%   98.21%   -0.27%     
==========================================
  Files          43       43              
  Lines        1839     2075     +236     
  Branches      114      129      +15     
==========================================
+ Hits         1811     2038     +227     
- Misses         25       31       +6     
- Partials        3        6       +3     
Files with missing lines Coverage Δ
src/data_morph/data/dataset.py 73.75% <100.00%> (+0.67%) ⬆️
src/data_morph/morpher.py 100.00% <100.00%> (ø)
src/data_morph/plotting/static.py 100.00% <100.00%> (ø)
tests/data/test_stats.py 100.00% <100.00%> (ø)
tests/test_morpher.py 100.00% <100.00%> (ø)
src/data_morph/data/stats.py 91.66% <91.34%> (-8.34%) ⬇️

@JCGoran
Copy link
Contributor Author

JCGoran commented Nov 12, 2024

Regarding the median: I was not satisfied with the performance of pure Python implementations of sorted containers which have $O(\log n)$ time complexity for the various ops (add, remove, search), so I decided to implement my own, written in C++ (since std::map is basically a sorted version of Python's dict, with the tradeoff that ops are $O(\log n)$ and not $O(1)$ ) with bindings to Python using nanobind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Work relating to data module morpher Work relating to morpher module packaging Relating to packaging plotting Work relating to plotting module shapes Work relating to shapes module testing Relating to the testing suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance optimizations of various statistics
1 participant